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Abstract 



We present a method for computing the 3D motion of articulated models from 
2D correspondences. An iterative batch algorithm is proposed which estimates the 
maximum aposteriori trajectory based on the 2D measurements subject to a number of 
constraints. These include (i) kinematic constraints based on a 3D kinematic model, 
(ii) joint angle limits, (iii) dynamic smoothing and (iv) 3D key frames which can be 
specified the user. The framework handles any variation in the number of constraints as 
well as partial or missing data. This method is shown to obtain favorable reconstruction 
results on a movie dance sequence. 
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1 Introduction 

Video is the primary archival source for human movement, with examples ranging from 
sports coverage of Olympic events to dance routines in Hollywood movies. The ability 
to reliably track the human figure in movie footage would unlock a large, untapped 
repository of motion data. However, the recovery of 3D figure motion from a single 
sequence of uncalibrated video images is a challenging problem. 

In contrast to the single-view case, 3D tracking of articulated objects from multiple 
camera views has been addressed by a number of authors [12, 14, 10, 1]. 3D kinematic 
models are used in these works to define the state space and constrain image measure- 
ments. Tracking proceeds by differentially adjusting the state using a nonlinear least 
squares (NLS) algorithm, until the projections of the kinematic model are aligned with 
the feature measurements in all of the images. The basic framework is described in 
[14]. 

In the multi-view case the tracker is solving two problems simultaneously: regis- 
tering the model with all of the images, and reconstructing the 3D pose of the figure 
from 2D measurements. With an adequate set of viewpoints, the state space is fully ob- 
servable and the state estimate will remain near the correct answer. In this case, the 3D 
kinematics provide a powerful constraint on image motion, simplifying the registration 
task. 

When only a single camera viewpoint is available, however, ambiguities can arise 
in the reconstruction of 3D pose under orthographic projection. The standard reflective 
ambiguity results in a pair of solutions for the rotation of a single link out of the image 
plane [16]. In addition, kinematic singularities arise when the out-of -plane rotation is 
zero [11]. The resulting loss of rank in the kinematic Jacobian complicates the use of 
NLS tracking algorithms. 

One solution, proposed in [11], is to decouple the registration and reconstruction 
stages. Motion of the figure in the image plane is described by a 2D Scaled-Prismatic 
model (described later in section 7.1). Tracking with the SPM model registers the 
figure with the image sequence, but defers the inference of 3D pose. The 3D recon- 
struction problem can then be formulated as a batch optimization over a series of SPM 
measurements. Within a batch formulation, it is easy to combine the entire set of im- 
age measurements with additional constraints in order to resolve ambiguities in 3D 
reconstruction. 

The paper describes a batch framework for 3D reconstruction using SPM measure- 
ments. In addition to kinematic constraints, we explore the use of three other types 
of constraints: dynamic models, joint angle limits, and 3D key frames, in resolving 
ambiguities in the estimated 3D pose of the figure. This paper makes a number of con- 
tributions: (1) It characterizes the space of ambiguous 3D solutions associated with a 
set of 2D SPM measurements. (2) It contains a complete derivation of the equations 
for batch 3D reconstruction using SPM measurements and additional constraints. (3) It 
presents experimental results on the reconstruction of 3D motion from a Fred Astaire 
dance sequence. 
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2 3D MOTION RECOVERY 



1.1 Problems in 3D Tracking 

There are two sources of ambiguity which make 3D figure tracking a difficult problem: 

1 . Ambiguity in determining 2D model-to-image correspondences. 

2. Ambiguity in reconstructing 3D pose. 

The 2D correspondence ambiguity arises from a variety of sources including back- 
ground clutter and imperfect features in the image. For example, optic-flow estimation 
based on image gradients may fail when image motion is large, because of the nonlin- 
earity of image structure. 

Additionally, correspondence data alone from a single view is insufficient for 3D 
reconstruction in that it does not encode 3D depth or orientation. Consider the prob- 
lem of inferring the pose of a 3D articulated chain in a calibrated orthographic view. 
The projection of each link is consistent with two possible 3D poses because of re- 
flective ambiguity. This problem is not easily solved by simply maintaining multiple 
solutions [16] as the number of possible solutions grows exponentially with additional 
links. 

Note that these two sources of ambiguity are significantly nonlinear. In particular, 
solving 2D correspondence ambiguity has to be cast as a search problem, i.e. there is 
no known formulaic function which computes correspondences from a vector of image 
pixels. For the problem of solving 3D reconstruction, such a direct computation may 
well exist but would contain numerous ambiguities; the problem in this case is selecting 
the correct solution for a list of candidates. 

Most of the previous work for tracking a figure in a video sequence involves an 
online linear framework for tracking a 3D kinematic model directly from image data. 
However, it is clear from the list of ambiguities above that the problem is significantly 
nonlinear, and the performance of such tracking methods is therefore highly dependent 
on the how well the problem can be linearized. We believe that the interaction of the 
nonlinearities in the two ambiguities causes the problem to be difficult to linearize 
directly in 3D state-space. Furthermore, the search efficiency for image features from 
3D state-space becomes increasingly poor in the vicinity of kinematic singularities. 

This paper then takes the same view as in [1 1] that tracking should be separated into 
the two processes of (i) 2D registration, and (ii) 3D motion recovery to properly cope 
with the nonlinear ambiguities. In this paper, only the 3D motion recovery problem is 
considered, and 2D correspondences are assumed to be available (section 7.1 discusses 
a method for obtaining these automatically). 

2 3D Motion Recovery 

The problem of 3D motion recovery can be approached from a signal reconstruction 
perspective. Suppose the 3D state for an articulated structure is represented by the con- 
catenation of all 3D pose parameters for all links in the structure. Next consider a set of 
such states representing the 'true' 3D motion sequence of the structure. The observed 
states are the result of each true state undergoing successive degradation through one 
to three lossy channels (see figure 1): 
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1. Noise. Noise is added to the true 3D states. 

2. Projection. Additionally, depth information is removed from some states through 
perspective projection. 

3. Deletion. Partial or full data of some states are deleted, e.g. in the case of partial 
or full occlusion or dropped frames. 

The goal is therefore to recover the true states from the set of available observed states. 




(a) 

A 



(b) 



(c) 



3D Motion Recovery 



(d) 



Figure 1: A channel-based model of the data degradation process, (a) shows the true 
3D trajectory with discrete states, (b) noise is added to the states, (c) some states are 
projected onto the image plane, losing depth information, (d) shows the final set of 
observed states after deletion of more states. The goal is to recover the true states from 
the set of available observed states. 

The separation of these degradation channels is useful for formulating a unified 
framework for seamlessly handling a large range of scenarios with different data degra- 
dation - e.g. from smoothing of noisy motion capture data with dropped frames, to 
estimating 3D figure motion from a video sequence with multiple occlusion events. 

3 Constraints for 3D Inference 



The difficult problem of interest is that of inferring the 3D states from 2D measure- 
ments (projection channel degradation), which is inherently ill-posed. In order to reg- 
ularize this problem, we need to utilize a number of constraints, which are 

• Kinematic constraints 

• Joint angle limits 

• Dynamic smoothing 
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3 CONSTRAINTS FOR 3D INFERENCE 



• 3D key frames 

The most important constraints are the kinematic constraints. Kinematic con- 
straints enforce connectivity between adjacent links, link length constancy as well as 
restrict joint motion to rotation about a fixed local axes in the case of revolute joints. 
These constraints are hard constraints and automatically enforced when estimation is 
done in the state-space of a 3D kinematic model. 

Of particular note is that simply applying a 3D kinematic model to 2D measure- 
ments restricts the solution to a number of isolated candidate regions in the kinematic 
state-space (modulo depth of base link under orthography). These correspond to the 
discrete combinations of 3D reflective ambiguities at each link (see figure 2) mentioned 
in section 1 . 1 . A candidate solution can be computed in each region using an iterative 
linear algorithm. 



Figure 2: 3D reflective ambiguity. The figure shows a revolute link which can rotate 
in a full circle. From the camera position shown, it is impossible to distinguish pose A 
from pose B based only on the link projection. 

However as mentioned earlier, a multiple-hypothesis or Viterbi-decoding scheme is 
generally infeasible because the number of such regions increases exponentially with 
the number of links. Hence the constraints listed below are also required to bias the 
solution to the correct candidate region using an iterative scheme. 

Joint angle constraints specify the limits to which joints can rotate about their cor- 
responding axes. Dynamic smoothing models describe the probability of a particular 
state based on past and future states, and are generally used to bias continuous rather 
than abrupt 3D motion. 3D key frames are 3D states which are interactively estab- 
lished by the user, and are equivalent to observed states undergoing only noise channel 
degradation. The formulation of these constraints is further described below. In order 
to take advantage of key frames and smoothing dynamics within the sequence, the 3D 
estimation is done in a batch framework described in section 5. 

Note that these constraints are not limited to 3D inference from 2D measurements. 
They can also be applied to 3D measurements and partial data (noise and deletion 
channel degradation). Further note that while these constraints can be applied as in- 
trinsic hard constraints which must be satisfied, it is more flexible to set them as soft 
constraints. These would affect the estimation by adding components to the overall 
residual equation. 



A 




B 
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4 Measurements and Constraints on Kinematic States 

In the following sections we describe the measurements and constraints used in our 
batch estimation framework. In particular, we will derive residual equations which can 
be expressed in the form of 

G{q) =w + e 

where q is the kinematic state to be computed, G(-) is a linearizable function, to is a 
measurement vector, and e is the residual vector which is to be minimized. These equa- 
tions can then be then be combined together and solved simultaneously as described in 
section 5. 

4.1 2D Measurements 

As mentioned in section 1, the image plane projections of links in an articulated struc- 
ture can be recovered from image data using a 2D SPM tracker. This allows us to 
express 2D measurements at a more abstract level than raw pixel data. 

In our framework, the image positions of the joints are used. The 2D measurement 
can then be expressed in the estimation framework as 

px i(Qt) = x it + e jt (1) 

where q t is the kinematic state for the fth time frame, Xj(-) is the forward kinematic 
function for computing the 3D position of the jth joint, P is the image plane projection 
matrix, Xjt is the observed image position of the joint, and ejt is the measurement 
noise. Since Xj(-) is nonlinear, (1) has to be relinearized at each iteration. 

4.2 Joint Angle Constraints 

One way to limit our solution for 3D motion to a physically valid result is to incor- 
porate limits on the range of joint angles for our kinematic model. For example, a 
human elbow can only rotate through about 135 degrees; it is advantageous to use this 
knowledge to obtain a plausible solution for 3D motion. For example as shown in fig- 
ure 3, knowing the forbidden interval for the joint angle of the link allows unambiguous 
selection of pose B. 

To incorporate limits on the range of revolute joint angles, we introduce inequality 
constraints such as qj > 8j where qj is the jth revolute angle parameter and 8j is the 
fixed lower limit for qj . A method of implementing this is to use a slack variable Xj to 
obtain an equation qj—dj—Xj = 0 as described in [5, 6]. This method is preferred to 
applying traditional differentiable barrier functions in the region of the limit, as it does 
not discourage the joint angle from reaching and staying at the limit. This typically 
occurs in human motion, e.g. when limbs are straightened. 

However, for each limit specified, an additional variable is introduced into the esti- 
mation. This causes a significant rise in computation time as the state-space dimension 
will increase approximately threefold. Further analysis shows that an almost identical 
effect can be created by using a first-order discontinuous residual function e(qj) of the 
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4 MEASUREMENTS AND CONSTRAINTS ON KINEMATIC STATES 




Figure 3: Disambiguation from joint angle limits. The joint angle limits prevents the 
selection of pose A leaving pose B as the only possibility. 

form 

da) -I °' Qj ~ 0j (2) 

where A; is a constant factor determining the strength of the inequality constraint. e(qj) 
is added to the overall residual equation for minimization, and the regime in effect is 
determined according to the value of qj at each iteration. Because this barrier function 
remains zeroth-order continuous, there is negligible solution instability at the limits. 



4.3 Dynamics 

Dynamics are used to express the greater probability for a particular form of motion, 
e.g. the typical preference for smooth continuous motion compared to abrupt motion 
(see figure 4). There are many different variants of dynamic models, ranging from sim- 
ple hand-constructed constant velocity models to complex switching models automat- 
ically learned from data [13]. The typical application of dynamic models in tracking 
is for forward prediction in the context of the Fokker-Planck drift-diffusion. However, 
dynamic models also can be expressed in an interpolating, or smoothing manner. This 
is particularly useful in a batch framework where the estimation of states in all time 
frames is done simultaneously. 

For general linear models, we can express the dynamic constraints 

Q-AQ = 0 + e d (3) 

where 

\I-A\ = 0, Ajtl 

Here Q is the vector of concatenated states for all time frames and e& is the dynamics 
process noise; A is the matrix of dynamic coefficients which will satisfy the stated 
conditions. Equation (3) represents the dynamics component to be added to the overall 
residual equation. 

In our experiments, a second order constant velocity model was used. In this case, 
the predicted current state q t = (q t _ 1 + q t+1 )/2 + e^u is the mean of immediate past 
and future states. 



4.4 3D Key Frames 
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M t t+1 

Figure 4: Disambiguation from dynamics. Knowing the approximate poses of the joint 
at frames t—1 and t+1 preferentially selects pose B at frame t when dynamics is used 
to bias towards smooth motion. 

The use of even simple dynamic prediction significantly helps in eliminating incor- 
rect sets of hypotheses due to 3D reflective ambiguities. While more accurate learned 
models are preferred if available, they unfortunately require vast amounts of training 
data for modeling such that intra-class and inter-class variations are captured. This 
poses a problem for learning 3D human motion models due to a difficulty of obtaining 
a large volume of data. In [8], a probabilistic model was derived from limited motion- 
capture data. However, it is difficult to imagine such a model will generalize well, 
especially considering that most motion-capture systems currently require individuals 
to don cumbersome apparatus which hinder the naturalness of the motion. 

4.4 3D Key Frames 

Despite the application of kinematics, joint angle limits and dynamic smoothing, 3D 
motion recovery is generally still underconstrained. Instead of attempting to solve the 
hard problem of using additional constraints such as shading cues, an intermediate 
solution is to access manual aid and allow an interactive user to set 3D key frames. See 
figure 5. 




Figure 5: Disambiguation from key frames. A key frame would specify the approx- 
imate pose of the joint, which in this case is located near pose B. Hence pose B is 
selected. Note that the key frame does not need to be exact. 



8 



5 3D BATCH FRAMEWORK 



Since these 3D key frames are inherently noisy, we treat these as noise channel 
degraded observations: 

q t = q* t + e kt (4) 

where ql is a key frame observed at frame t, and e& t is the measurement noise of the 
key frame. 

For greater generality, we also allow the specification of partial key frames, in 
which only some state parameters are established. For example this may be used to 
disambiguate the angles of one joint in a human figure model if this is the only am- 
biguous limb. In the context of (4), the unestablished state parameters will have infinite 
variance. 

In an interactive setting, the user will initially apply the solver with a minimal 
number of key frames, e.g. at the start and the end of the sequence and potentially 
problematic frames with departure from the expected dynamics. Any resulting gross 
estimation errors may be corrected by introducing additional key frames and reapplying 
the solver. 



5 3D Batch Framework 

Our 3D batch framework involves iterative least squares and solving for the 2D mea- 
surements and other constraints simultaneously in all time frames. Assuming that the 
noise in the measurements and constraints are Gaussian distributed, the framework 
computes the maximum aposteriori (MAP) estimate 

Q = argmax {p(Q\Z)} 
Q 

for the full trajectory state Q (consisting of the states in all time frames) given the 2D 
measurements Z. In this case, the priors are the constraints which are applied to the 
estimation. 

Note that our batch framework applies dynamic models differently than the stan- 
dard Kalman and forward-backward smoothing filters. Therefore the results obtained 
are different - the filters compute a sequence of marginal MAP states based on p(q t | Z), 
which is not the same as the t state in the MAP trajectory Q computed by our frame- 
work. 

Measurements and constraints are incorporated in the batch estimation by merging 
all residual equations expressed in (1), (2), (3), and (4) for all time frames in the form 
of 

J T ^- L JdQ = J T S _1 i? (5) 

where J is the overall Jacobian, R is the overall measurement vector, and £ is the 
measurement co variance. The matrix J T E _1 J is block-diagonal and grouped accord- 
ing to time frames. We further add a stabilization term kl to the J T E _1 J, where k is 
a constant. 

Note that the 2D measurements, joint angle limits, dynamics and 3D key frames 
are represented as rows in J and treated in the same unified manner by the framework. 
The allows great flexibility, e.g. for including as many 3D key frames as required, or 
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even changing constraints on the fly. Handling partial missing data simply involves 
zeroing some of the entries in E _1 . 

Finally, (5) is solved iteratively using the Gauss-Newton least-squares method with 
a sparse-matrix inversion routine. 

6 Results 

One of our key goals is to recover 3D human motion from video footage, and in this 
paper we present results from a Fred Astaire dance sequence. The top row of figure 6 
shows four frames from the test sequence, superimposed with 2D SPM measurements. 
These 2D SPM measurements have to be manually specified as none of the current 
trackers are able to track successfully from video when there is significant 3D body 
rotation, which occurs in our test sequence. These 2D measurements for the 14-frame 
sequence are used as input into our 3D estimation framework. Two 3D key frames 
are manually specified at the start and end of the sequence. The estimation involved 
running the algorithm for 20 iterations taking a total of 27 seconds. The final output 
was imported into 3D Studio Max and rendered. 

The middle row of figure 6 shows the corresponding reconstruction of the frames 
rendered from approximately the original camera viewpoint. The bottom row show the 
same reconstruction but from a viewpoint which is up and left of the original camera 
position. 

The results reflect a highly credible reconstruction of the original 3D motion. How- 
ever, one artifact is the small inter-penetration of the legs in the middle of the sequence. 
This is because the thickness of the body parts were unaccounted for in the estimation 
framework, and we currently do not place any constraints on limbs inter-penetrating. 
One other noticeable artifact is that head rotation about the spinal axis is not recovered 
and hence the facial direction cannot be reconstructed. 

7 Previous Work 

Many researchers have tackled the problem of tracking 3D articulated models with 
multiple cameras. Rehg [15] tracked hands using an extended Kalman filter framework. 
O'Rourke and Badler [12] and Gavrila and Davis [3] used multiple cameras to obtain 
3D positions of the human body, while Bregler and Malik [1] used Kalman filtering to 
exploit dynamic constraints. 

To obtain a less complete, but still useful, interpretation of motion, many researchers 
have attempted tracking in 2D from a single camera. Hogg [7] and Bregler and Ma- 
lik [1] studied the case of the human walking parallel to the image plane, which limits 
the solution to two dimensions. Hel-Or and Werman [6] applied joint constraints to 
find 2D in-plane motion, in both Kalman filter and batch solutions. Other papers al- 
low motion out of the plane of view, but only attempt to fit a 2D model to the image 
stream [9]. Morris and Rehg [11] both used 2D models with prismatic joints to do this. 
Such tracking data may be useful for classification of 3D motion, but it is inadequate 
for true 3D motion analysis. 
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7 PREVIOUS WORK 




Figure 6: Top: Manually specified 2D SPM measurements for Fred Astaire in four 
frames. Middle: 3D state estimates produced by batch estimation algorithm, shown 
from original camera viewpoint. Bottom: 3D state estimates shown from a different 
camera viewpoint. 



7.1 2D SPM Tracking 
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Few attempts have been made to capture 3D motion from a single image stream. 
Goncalves et al. [4] tracked a human arm in a very constrained environment with min- 
imal reflective ambiguity. Shimada et al. [16] capture hand motion from one camera, 
using Kalman filtering and sampling the solution probability space. They exploit join 
constraints by truncating the probability space. The strength of joint constraints in the 
hand model helped make this possible (e.g. finger joints can only rotate approximately 
90 degrees). Howe et al. [8] also recover 3D position of a human figure, but with 
limited movement out of the plane of vision and no body rotation. 

7.1 2D SPM Tracking 

In order to obtain 2D joint positions from image data, we describe the figure projec- 
tion using a scaled prismatic model (SPM), introduced in [11]. The model enforces 
2D constraints on figure motion that are consistent with the underlying 3D kinematic 
model. 

Each link in a scaled prismatic model describes the image plane appearance of 
an associated rigid link in an underlying 3D kinematic chain. Each SPM link can 
rotate and translate in the image plane. The rotation degree of freedom (dof) captures 
the projected link orientation of revolute joints in the 3D model. The translation dof 
captures the foreshortening that occurs when 3D links rotate into and out of the image 
plane. The figure can then be modeled in 2D as a branched chain with arms, legs, 
and head modeled as SPM links. A complete discussion of SPM models, including a 
derivation of the SPM Jacobian and an analysis of its singularities, can be found in [1 1]. 

Each SPM link is associated with a template representation of its appearance. 
Tracking then involves minimizing the difference between the projected template and 
the corresponding pixels in the image frame. The multiple-hypothesis statistical frame- 
work proposed in [2] generates a probabilistic representation for the SPM state which 
can eventually be integrated with the current 3D estimation framework. However, as 
the template representation for a torso does not cope well with out-of-plane rotation, 
automated tracking is generally limited to motion without significant torso rotation. 

8 Summary and Future Work 

We presented a method for recovering the 3D motion of articulated models from a 
sequence of 2D SPM measurements. It exploits a number of constraints including 
kinematic constraints, joint angle limits, dynamic smoothing and 3D key frames. The 
equations for these constraints were derived and integrated into a 3D batch estimation 
framework. The estimation framework is flexible and can easily cope with variation in 
the number of constraints applied, and also with partial or missing data. The goal of 
this work is to be able to apply the method to reconstructing figure motion from video 
footage. 

The favorable reconstruction results shown for a Fred Astaire dance sequence illus- 
trate the capability of using multiple constraints to reduce 3D ambiguity. However one 
problem encountered is the partial inter-penetration of limbs in the sequence. 
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For the future, we intend to add further constraints to our framework. This includes 
volume exclusion constraints to avoid inter-penetration of links, as well as making use 
of self-occlusion cues to further help disambiguate 3D pose. We also plan to enhance 
the estimation framework to cope with remaining unfiltered ambiguities, possibility 
using a multiple hypothesis statistical framework. Finally, we will explore ways to 
fully automate the process of video to 3D figure motion recovery. This will include 
the interleaving of the 3D estimation framework with 2D tracking to improve both the 
robustness of 2D registration and the quality of 3D reconstruction. 
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